-
-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/issue 2799 robust no u turn #2800
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good. Do we have any hypothesis test-tests in place for this like we do for the RNGs? That's how the last sampler problem got found right?
EXPECT_EQ(4 * init_momentum, sampler.rho_values.at(5)); | ||
EXPECT_EQ(8 * init_momentum, sampler.rho_values.at(6)); | ||
EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(5)); | ||
EXPECT_EQ(4 * init_momentum, sampler.rho_values.at(6)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be 14 more of these checks since there are now 3x the number of rho_values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's only one aggregate rho exposed in the build_tree
function. The others needed for the new tests are constructed from this aggregate rho on the fly internally and hence not accessible for unit testing. That internal construction could be repeated externally using the boundary momenta but then there's nothing to compare it to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this. These values are being returned so why shouldn't they be tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to work:
EXPECT_EQ(7 * 3, sampler.rho_values.size());
EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(0));
EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(1));
EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(2));
EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(3));
EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(4));
EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(5));
EXPECT_EQ(4 * init_momentum, sampler.rho_values.at(6));
EXPECT_EQ(3 * init_momentum, sampler.rho_values.at(7));
EXPECT_EQ(3 * init_momentum, sampler.rho_values.at(8));
EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(9));
EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(10));
EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(11));
EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(12));
EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(13));
EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(14));
EXPECT_EQ(4 * init_momentum, sampler.rho_values.at(15));
EXPECT_EQ(3 * init_momentum, sampler.rho_values.at(16));
EXPECT_EQ(3 * init_momentum, sampler.rho_values.at(17));
EXPECT_EQ(8 * init_momentum, sampler.rho_values.at(18));
EXPECT_EQ(5 * init_momentum, sampler.rho_values.at(19));
EXPECT_EQ(5 * init_momentum, sampler.rho_values.at(20));
If it makes sense, just add em' in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry, I finally get what you were saying. I was thinking that there aren't any new rho values to check because the tree depth is the same but forgot how this code was instrumented to add a new rho values for every check, not every tree depth. I added the new checks split up into the values corresponding to the main check and the extra checks. Waiting to push in case any changes need to be made to the algorithm itself.
There's currently no validation of the samplers as explicit integration tests because there's not enough functionality to run everything from pure C++ (the main obstruction is tearing out the |
Why not use R dump or JSON format for var contexts? They're not in memory, but I/O performance shouldn't be an issue for testing.
… On Aug 18, 2019, at 5:21 PM, Michael Betancourt ***@***.***> wrote:
There's currently no validation of the samplers as explicit integration tests because there's not enough functionality to run everything from pure C++ (the main obstruction is tearing out the var_context code and replacing it with a clear data access layer and then writing C++ data access layer callbacks so that everything can be run in memory; I have designs that were long ago discussed and agreed upon but no one has had the time to implement it). In any case I did check this code against some IID models of varying dimension and a correlated Gaussian in which the last bug manifested (using 100000 iterations to be sufficient sensitive to small effects) and everything looked good.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@betanalpha oh okay, got it. Can you e-mail me the models for the test? I guess I should just manually double check that it's all working okay since it's not automated. @seantalts yo just at-ing you so you're aware of the state of tests. We should probably do something about this eventually. |
They’re all in benchmark repo, https://github.com/stan-dev/stat_comp_benchmarks/tree/master/benchmarks.
The baselines for each expectation value are in https://github.com/stan-dev/stat_comp_benchmarks/tree/master/empirical_results/nuts.
You can run them all or just check the basics like low_dim_corr_gauss and
maybe a 50 or 100 dimensional IID gaussian,
parameters {
real x[100];
}
model {
x ~ normal(0, 1);
}
… On Aug 24, 2019, at 3:17 AM, Ben Bales ***@***.***> wrote:
@betanalpha oh okay, got it. Can you e-mail me the models for the test? I guess I should just manually double check that it's all working okay since it's not automated.
@seantalts yo just at-ing you so you're aware of the state of tests. We should probably do something about this eventually.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Got it thanks. I'll run these checks Tuesday or Wednesday (still travelin'). And the extra return values from build_tree? |
Thanks a lot for the discussion and the validation!! |
The test you were looking at checks just the rho aggregation. The edge momenta and sharp momenta are checked in other tests, including the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried out adding tests for the rho_values. See if you think they make sense.
I'm having trouble with this. I ran tests with the model here: https://github.com/stan-dev/stat_comp_benchmarks/blob/master/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan I ran a bunch of these models with different numbers of post warmup samples with the new and old code. I made plots for each parameter (and generated quantity) of mean +- 2 * the MCSE. The left column are the results with the new code and the right is with the old code. Different rows are different parameters. I plotted the true parameter value in red. The x axis is the number of post warmup draws. The things are all jittered -- this plot shows results run for 64000, 256000, 1024000, 4096000, and 16384000 iters. For large numbers of samples, the delta_var1 and delta_var2 are not zero. I don't know if I'm testing this correctly or if I'm doing the check wrong, but can you try to reproduce this? For reproducing, my test code looked something like this (I used the dense adaptation for the new and old codes):
Run this script for each code and dump the output in analysis.csv. Make the plots with something like:
|
That’s definitely not good.
Do you get the same result when you run with the diagonal metric?
Let me see if I can reproduce locally.
… On Aug 28, 2019, at 3:32 PM, Ben Bales ***@***.***> wrote:
I'm having trouble with this. I ran tests with the model here: https://github.com/stan-dev/stat_comp_benchmarks/blob/master/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan <https://github.com/stan-dev/stat_comp_benchmarks/blob/master/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan>
I ran a bunch of these models with different numbers of post warmup samples with the new and old code. I made plots for each parameter (and generated quantity) of mean +- 2 * the MCSE.
The left column are the results with the new code and the right is with the old code. Different rows are different parameters. I plotted the true parameter value in red. The x axis is the number of post warmup draws. The things are all jittered -- this plot shows results run for 64000, 256000, 1024000, 4096000, and 16384000 iters. For large numbers of samples, the delta_var1 and delta_var2 are not zero.
I don't know if I'm testing this correctly or if I'm doing the check wrong, but can you try to reproduce this?
<https://user-images.githubusercontent.com/4742424/63885819-29d42080-c9a7-11e9-8d6a-0450a12aa9a4.png>
For reproducing, my test code looked something like this (I used the dense adaptation for the new and old codes):
for i in 1000 4000 8000 16000 64000
do
for j in 8 9 10 11 12 13 14 15
do
./corr sample num_samples=$i algorithm=hmc metric=dense_e >& /dev/null
bin/stansummary --sig_figs=5 output.csv | grep "^z\|^delta_var" | tr -s " " "," | sed -e "s/^/$i,$j,/"
done
done
Run this script for each code and dump the output in analysis.csv. Make the plots with something like:
library(tidyverse)
library(ggplot2)
results_new = read_csv("/home/bbales2/cmdstan-uturn/analysis.csv", col_names = FALSE) %>%
mutate(which = "new")
results_orig = read_csv("/home/bbales2/cmdstan-current/analysis.csv", col_names = FALSE) %>%
mutate(which = "original")
bind_rows(results_new,
results_orig) %>%
mutate(iters = X1,
chain = X2,
name = X3,
mean = X4,
ub = mean + 2 * X5,
lb = mean - 2 * X5) %>%
filter(iters > 50000) %>%
ggplot(aes(iters, mean)) +
geom_hline(data = tibble(name = c("delta_var1", "delta_var2", "z[1]", "z[2]"),
y = c(0.0, 0.0, 0.0, 3.0)), aes(yintercept = y), color = "red") +
geom_point(aes(group = chain), position = position_dodge(width=0.5)) +
geom_errorbar(aes(group = chain, ymin = lb, ymax = ub), position=position_dodge(width=0.5)) +
scale_x_log10() +
facet_grid(name ~ which, scales = "free_y") +
theme(text = element_text(size=20))
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#2800?email_source=notifications&email_token=AALU3FWXBAMTVOJM4A5PMQTQG3HDRA5CNFSM4ILGSCZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5MHDAI#issuecomment-525889921>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AALU3FUOPIUEM3GF2XS54KDQG3HDRANCNFSM4ILGSCZQ>.
|
What is the Just to confirm -- the iteration lengths shown in your test code are different from those used to make the figure, right? I'm still not seeing any bias for higher numbers of iterations using the default settings (sample num_samples=4096000 random seed=838389).
|
Same for a dense metric,
|
Yeah j is just repeated runs. And yeah I go up to 16384000 in the figures. Try some repeated runs if you don't mind. It looks like it's messing up about half the time at 4096000 iters. I was a little concerned with the dense as well, but the plain Stan works with the dense metric. I've checked git diff and I don't have extra changes. I will repeat this calculation overnight, but I just pulled down a clean repo and the md5sum of the model built with this clean repo and the model I used for these calculations in the figure is the same. |
This is the script I'll run:
(edited to use random seed) |
I recommend fixing a seed and then using |
This is weeeeeeeeeird. For some seeds I do see the effect that you're seeing.
What's odd, however, it doesn't look there's some bias that manifests only once the standard error shrinks enough. Rather it looks like the standard error shrinks and then at some point a bias larger than the standard error suddenly appears. Then there's the fact that it's an inconsistent bias, and maybe correlated with slightly higher step sizes? I'm going to go through the code carefully to see if I can see anything. On the other hand I'm not sure any algorithm has been tested to one million effective samples so if the current version didn't behave as expected then I'd suspect a numerical issue with the estimators themselves. |
You mean things like the mean and quantile estimators?
Adding a million numbers will lose a lot of precision if they're all the same sign. For example, imagine 1e6 numbers between 0 and 1. You wind up adding numbers on the scale of 5e5 (the expectation) and 1 (the individual numbers being added), which loses more than 5 decimal places of precision (out of a total of 16 or so in double-precision floating point).
… On Aug 28, 2019, at 9:12 PM, Michael Betancourt ***@***.***> wrote:
This is weeeeeeeeeird. For some seeds I do see the effect that you're seeing.
1 chains: each with iter=(16384000); warmup=(0); thin=(1); 16384000 iterations saved.
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
z[1] 4.9e-04 3.4e-04 1.0e+00 -1.6 4.4e-04 1.6e+00 8.4e+06 1.2e+04 1.0e+00
z[2] 3.0e+00 7.3e-04 2.0e+00 -0.28 3.0e+00 6.3e+00 7.4e+06 1.0e+04 1.0e+00
delta_var1 -6.2e-03 4.7e-04 1.4e+00 -1.00 -5.5e-01 2.8e+00 8.9e+06 1.3e+04 1.0e+00
delta_var2 -2.4e-02 2.0e-03 5.6e+00 -4.0 -2.2e+00 1.1e+01 8.2e+06 1.2e+04 1.0e+00
delta_corr -2.3e-03 4.0e-04 1.1e+00 -1.1 -3.4e-01 2.2e+00 7.8e+06 1.1e+04 1.0e+00
What's odd, however, it doesn't look there's some bias that manifests only once the standard error shrinks enough. Rather it looks like the standard error shrinks and then at some point a bias larger than the standard error suddenly appears. Then there's the fact that it's an inconsistent bias, and maybe correlated with slightly higher step sizes?
I'm going to go through the code carefully to see if I can see anything. On the other hand I'm not sure any algorithm has been tested to one million effective samples so if the current version didn't behave as expected then I'd suspect a numerical issue with the estimators themselves.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Of course, but the issue is that there is no corresponding
bias in the baseline version of the sampler so it can’t be
something shared between the two versions, such as the
estimator code.
… On Aug 28, 2019, at 9:17 PM, Bob Carpenter ***@***.***> wrote:
You mean things like the mean and quantile estimators?
Adding a million numbers will lose a lot of precision if they're all the same sign. For example, imagine 1e6 numbers between 0 and 1. You wind up adding numbers on the scale of 5e5 (the expectation) and 1 (the individual numbers being added), which loses more than 5 decimal places of precision (out of a total of 16 or so in double-precision floating point).
> On Aug 28, 2019, at 9:12 PM, Michael Betancourt ***@***.***> wrote:
>
> This is weeeeeeeeeird. For some seeds I do see the effect that you're seeing.
>
> 1 chains: each with iter=(16384000); warmup=(0); thin=(1); 16384000 iterations saved.
>
> Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
> z[1] 4.9e-04 3.4e-04 1.0e+00 -1.6 4.4e-04 1.6e+00 8.4e+06 1.2e+04 1.0e+00
> z[2] 3.0e+00 7.3e-04 2.0e+00 -0.28 3.0e+00 6.3e+00 7.4e+06 1.0e+04 1.0e+00
> delta_var1 -6.2e-03 4.7e-04 1.4e+00 -1.00 -5.5e-01 2.8e+00 8.9e+06 1.3e+04 1.0e+00
> delta_var2 -2.4e-02 2.0e-03 5.6e+00 -4.0 -2.2e+00 1.1e+01 8.2e+06 1.2e+04 1.0e+00
> delta_corr -2.3e-03 4.0e-04 1.1e+00 -1.1 -3.4e-01 2.2e+00 7.8e+06 1.1e+04 1.0e+00
>
> What's odd, however, it doesn't look there's some bias that manifests only once the standard error shrinks enough. Rather it looks like the standard error shrinks and then at some point a bias larger than the standard error suddenly appears. Then there's the fact that it's an inconsistent bias, and maybe correlated with slightly higher step sizes?
>
> I'm going to go through the code carefully to see if I can see anything. On the other hand I'm not sure any algorithm has been tested to one million effective samples so if the current version didn't behave as expected then I'd suspect a numerical issue with the estimators themselves.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub, or mute the thread.
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#2800?email_source=notifications&email_token=AALU3FVEVHTTX3CHJ2SJICLQG4PSXA5CNFSM4ILGSCZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5M5OAY#issuecomment-525981443>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AALU3FW4QKPRK3NBJYZ76EDQG4PSXANCNFSM4ILGSCZQ>.
|
I can't find anything that looks suspect in the code -- anyone want to take a second look? I've been poking around at edge cases and places where redundant calculations in the added checks might be vulnerable to insufficient precision introducing errors but I can't find anything inconsistent. |
I think I found it — there’s a bug in how the edge momenta and
sharp momenta are updated before a new expansion/proposal.
As in some of them aren’t updated as they should be, causing
the new checks to be slightly off (and getting increasingly off
as the tree depth increases). I’m testing the seeds that were
previously manifesting the bias now.
… On Aug 29, 2019, at 8:38 AM, Ben Bales ***@***.***> wrote:
Here's the results of running overnight.
<https://user-images.githubusercontent.com/4742424/63940195-01940280-ca37-11e9-9792-26b2b621aa99.png>
anyone want to take a second look?
I'll make a post in the discourse thread. I will eventually, but this looks like it could be a difficult problem to figure out.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#2800?email_source=notifications&email_token=AALU3FWP53OULXW7W6EP63LQG67KXA5CNFSM4ILGSCZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5OKNOQ#issuecomment-526165690>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AALU3FVN5USQE4ZQVT6NP5TQG67KXANCNFSM4ILGSCZQ>.
|
Yup, that did the trick.
goes to
I'll push once the tests are updated to the new behavior. |
Great thanks! |
Changes pushed -- I updated I don't think I've ever seen a sampler tested to this level of precision! Pretty awesome. |
Performance tests look good -- small differences but the general story is pretty much the same. I'll post to Discourse. |
On Aug 29, 2019, at 3:11 PM, Michael Betancourt ***@***.***> wrote:
Changes pushed -- I updated base_nuts_test with a new instrumented mock sampler that tests the edge momenta that weren't being updated correctly before. I'm going to rerun the performance tests to verify that hasn't been affected, but given that this only shows up at <Dr. Evil voice>one million iterations</Dr. Evil voice> I don't think there will be much of an affect.
I don't think I've ever seen a sampler tested to this level of precision! Pretty awesome.
Indeed. Ben's going to get a reputation for seriously heroic testing (after this and my autodiff test PR).
|
Don't jinx us. This looks good. @bob-carpenter can you merge? I'm not authorized. |
* [WIP] Robust U-turn check Following the recent discussion on the Stan side: stan-dev/stan#2800 For experiment, do not merge. * typo fix * bug fix * Additional U turn check only when depth > 1 (to avoid redundant work). * further logic to reduce redundant U Turn check. * bug fix fix error in recording the end point of the reversed subtree * [WIP] Robust U-turn check Following the recent discussion on the Stan side: stan-dev/stan#2800 For experiment, do not merge. * typo fix * bug fix * Additional U turn check only when depth > 1 (to avoid redundant work). * further logic to reduce redundant U Turn check. * bug fix fix error in recording the end point of the reversed subtree * Add release note.
Adds keyword argument to dynamic integration transitions and samplers to enable extra subtree termination criterion checks as described in stan-dev/stan#2800. Extra subtree checks are set to be enabled by default for the `DynamicMultinomialHMC` sampler.
Summary: Pull Request resolved: #864 An issue with the U-turn condition was discovered and discussed in [this post in Stan forum](https://discourse.mc-stan.org/t/nuts-misses-u-turns-runs-in-circles-until-max-treedepth/9727) TL;DR: we can make the U-turn condition more robust by introducing two additional checks across subtrees. This can help us avoid missing U-turns for approximately iid normal models. {F619223264} Since the tree combining code are almost identical in `_build_tree` and `propose`, I also take the chance to refactor them into a common function called `_combine_tree`. If you look closely you will notice that most part of `_combine_tree` are moved from existing code as-is. The only addition is the two additional call to `_is_u_turning` Related PR that implements this change: - Stan: stan-dev/stan#2800 - PyMC3: pymc-devs/pymc#3605 - Turing.jl: TuringLang/AdvancedHMC.jl#207 - DynamicHMC.jl: tpapp/DynamicHMC.jl#145 Reviewed By: neerajprad Differential Revision: D28735950 fbshipit-source-id: ada4ebcad26a87ef5e697f422b5c5b17007afe42
Submission Checklist
./runTests.py src/test/unit
make cpplint
Summary
Resolves #2799.
Intended Effect
Adds additional no-u-turn checks across subtrees to avoid missing u-turns for approximately iid normal models.
How to Verify
See https://discourse.mc-stan.org/t/nuts-misses-u-turns-runs-in-circles-until-max-treedepth/9727/36?u=betanalpha.
Side Effects
Decrease in antithetic behavior for component means in correlated models.
Documentation
Inline.
Copyright and Licensing
Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company):
Michael Betancourt
By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses: